A rack of Bike Share bikes with the Cyclistic logo
A rack of Bike Share bikes with the Cyclistic logo

Introduction

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Lily Moreno, Director of Marketing, believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a solid opportunity to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

Business task

This project will answer the following key question:

How do casual riders and annual members use Cyclistic bikes differently?

Supporting questions

In order to answer the key question, the following questions will be analysed:

  • What proportion of the total rides come from casual riders?
  • What is the average trip duration for each type of rider?
  • Which days of the week are the most popular for each type of rider?
  • Which times of day are most popular with each type of rider?
  • Which rental stations are most popular with each type of rider?
  • Is the total number of rides trending up or down for each type of rider?
  • Which types of bike are most popular with each type of rider?
  • How many riders make round trips (where start and end station is the same)? Is one type of rider more likely to do this than another?

Key stakeholders

The key stakeholders for this project are:

  • Director of Marketing
  • Marketing team
  • Cyclistic executive team

The key stakeholders are non-technical business leaders. This report will present findings in plain English and will be accompanied by an executive summary slide deck.

Prepare data

In this section, we provide a description of all data sources used and how the data is organised. We check for issues with bias or credibility and verify the data’s integrity. We check that the data will answer our questions, and that there are no problems with the data.

Data sources

In order to answer the business question, this report uses Cyclistic’s historical trip data.1 This dataset contains the following information:

  • Ride ID
  • Type of bike
  • Start and end date and time of the ride
  • Start and end station name, ID and latitude/longitude
  • Rider type (member or casual)

For this analysis we downloaded data for the last 12 months from the source above. This analysis covers the period from 1 January 2024 to 31 December 2024.

Load data

We read the data for the last 12 months by running the 1_read_data.R script located in the working directory.

source(here::here("R/1_read_data.R"))

# View the head of the data
head(df)
## # A tibble: 6 × 15
##   filename         ride_id rideable_type started_at          ended_at           
##   <chr>            <chr>   <chr>         <dttm>              <dttm>             
## 1 /Users/claregib… C1D650… electric_bike 2024-01-12 15:30:27 2024-01-12 15:37:59
## 2 /Users/claregib… EECD38… electric_bike 2024-01-08 15:45:46 2024-01-08 15:52:59
## 3 /Users/claregib… F4A9CE… electric_bike 2024-01-27 12:27:19 2024-01-27 12:35:19
## 4 /Users/claregib… 0A0D9E… classic_bike  2024-01-29 16:26:17 2024-01-29 16:56:06
## 5 /Users/claregib… 33FFC9… classic_bike  2024-01-31 05:43:23 2024-01-31 06:09:35
## 6 /Users/claregib… C96080… classic_bike  2024-01-07 11:21:24 2024-01-07 11:30:03
## # ℹ 10 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## #   short_filename <chr>

Data profiling

We run code to produce a profile of each column within the data set.

skimr::skim_without_charts(df)
Data summary
Name df
Number of rows 5860568
Number of columns 15
_______________________
Column type frequency:
character 9
numeric 4
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
filename 0 1.00 128 128 0 12 0
ride_id 0 1.00 16 16 0 5860357 0
rideable_type 0 1.00 12 16 0 3 0
start_station_name 1073951 0.82 10 64 0 1808 0
start_station_id 1073951 0.82 3 35 0 1763 0
end_station_name 1104653 0.81 10 64 0 1815 0
end_station_id 1104653 0.81 3 35 0 1768 0
member_casual 0 1.00 6 6 0 2 0
short_filename 0 1.00 6 6 0 12 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.04 41.64 41.88 41.90 41.93 42.07
start_lng 0 1 -87.65 0.03 -87.91 -87.66 -87.64 -87.63 -87.52
end_lat 7232 1 41.90 0.06 16.06 41.88 41.90 41.93 87.96
end_lng 7232 1 -87.65 0.11 -144.05 -87.66 -87.64 -87.63 152.53

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2024-01-01 00:00:39 2024-12-31 23:56:49 2024-07-22 20:36:16 5649600
ended_at 0 1 2024-01-01 00:04:20 2024-12-31 23:59:55 2024-07-22 20:53:59 5652165

From the data profile, we can see that there are >1M records with missing data for start and end stations. We need to better understand where these missing records occur. First let’s try counting the number of missing records by filename.

df |> 
  filter(is.na(start_station_id) | is.na(end_station_id)) |> 
  count(short_filename)
## # A tibble: 12 × 2
##    short_filename      n
##    <chr>           <int>
##  1 202401          31065
##  2 202402          38428
##  3 202403          71409
##  4 202404         117227
##  5 202405         167325
##  6 202406         216395
##  7 202407         208031
##  8 202408         214424
##  9 202409         284042
## 10 202410         167167
## 11 202411          89104
## 12 202412          47642

There is missing data within every file. Let’s check if the missing data is restricted to a specific bike type.

df |> 
 filter(is.na(start_station_id) | is.na(end_station_id)) |> 
  count(rideable_type) 
## # A tibble: 3 × 2
##   rideable_type          n
##   <chr>              <int>
## 1 classic_bike        7489
## 2 electric_bike    1548260
## 3 electric_scooter   96510

Here we can see that most of the missing station data is coming from electric bikes. We can check to see if the latitude and longitude associated with the missing records can be matched to records without missing station data. If so, we can infer the station data from the non-missing records.

# How many stations are there for every distinct lat/long?
df |> 
  group_by(start_lat, start_lng) |> 
  mutate(stations = n_distinct(start_station_id)) |> 
  filter(stations > 1) |> 
  select(start_lat, start_lng, stations, start_station_id, start_station_name) |> 
  distinct() |> 
  arrange(start_station_id) |> 
  head(10)
## # A tibble: 10 × 5
## # Groups:   start_lat, start_lng [8]
##    start_lat start_lng stations start_station_id start_station_name             
##        <dbl>     <dbl>    <int> <chr>            <chr>                          
##  1      41.8     -87.7        2 1042             Public Rack - Laflin St &51st …
##  2      42.0     -87.8        2 1277.0           Public Rack - Peterson Park    
##  3      42.0     -87.8        2 1284.0           Public Rack - Forest Glen Stat…
##  4      41.9     -87.6        2 13001            Michigan Ave & Washington St   
##  5      41.9     -87.6        3 13001            Michigan Ave & Washington St   
##  6      41.9     -87.6        2 13001            Michigan Ave & Washington St   
##  7      41.9     -87.7        5 13001            Michigan Ave & Washington St   
##  8      41.9     -87.6        3 13006            LaSalle St & Washington St     
##  9      41.9     -87.7        2 13006            LaSalle St & Washington St     
## 10      41.9     -87.7        2 13008            Millennium Park
# How many lat/longs are there for every distinct station?
df |> 
  group_by(start_station_id) |> 
  mutate(lat_lngs = n_distinct(start_lat, start_lng)) |> 
  filter(lat_lngs > 1) |> 
  select(start_lat, start_lng, lat_lngs, start_station_id, start_station_name) |> 
  distinct() |> 
  arrange(start_lat) |> 
  head(10)
## # A tibble: 10 × 5
## # Groups:   start_station_id [2]
##    start_lat start_lng lat_lngs start_station_id start_station_name     
##        <dbl>     <dbl>    <int> <chr>            <chr>                  
##  1      41.6     -87.5      725 <NA>             <NA>                   
##  2      41.6     -87.6      725 <NA>             <NA>                   
##  3      41.6     -87.6      725 <NA>             <NA>                   
##  4      41.6     -87.5       24 20215            Hegewisch Metra Station
##  5      41.6     -87.5       24 20215            Hegewisch Metra Station
##  6      41.6     -87.5       24 20215            Hegewisch Metra Station
##  7      41.6     -87.5       24 20215            Hegewisch Metra Station
##  8      41.6     -87.5       24 20215            Hegewisch Metra Station
##  9      41.6     -87.5       24 20215            Hegewisch Metra Station
## 10      41.6     -87.5       24 20215            Hegewisch Metra Station

The code above shows that there is a many-to-many relationship between station ID and lat/long, meaning that for every station ID there can be many lat/long values and for every lat/long value there can be many station IDs. This will make it difficult to infer missing station data, and therefore we will elect to exclude records with missing station data in the analysis of questions related to rental stations.

We can also review the station_id and station_name columns to ensure that we have consistency with the IDs and names. First, we set up a dataframe containing all of the distinct combinations of station_id and station_name.

start_stations <- 
  df |> 
  select(
    station_id = start_station_id,
    station_name = start_station_name
  ) |> 
  distinct()

end_stations <- 
  df |> 
  select(
    station_id = end_station_id,
    station_name = end_station_name
  ) |> 
  distinct()

stations <- 
  start_stations |> 
  bind_rows(end_stations) |> 
  distinct() |> 
  arrange(station_id)

We can use this dataframe to check that each station_id is associated with a single station_name.

stations |> 
  count(station_id) |> 
  arrange(desc(n)) |> 
  filter(n > 1)
## # A tibble: 97 × 2
##    station_id                              n
##    <chr>                               <int>
##  1 Hubbard Bike-checking (LBS-WH-TEST)     3
##  2 1524189                                 2
##  3 20129                                   2
##  4 21322                                   2
##  5 21366                                   2
##  6 21371                                   2
##  7 21379                                   2
##  8 21393                                   2
##  9 23114                                   2
## 10 23187                                   2
## # ℹ 87 more rows

There are 97 station_ids that are associated with multiple station_names. We’ll ned to account for this in the data cleaning.

Next, we check whether each station_name is associated with a single station_id.

stations |> 
  count(station_name) |> 
  arrange(desc(n)) |> 
  filter(n > 1)
## # A tibble: 49 × 2
##    station_name                        n
##    <chr>                           <int>
##  1 Albany Ave & Douglas Blvd           2
##  2 Artesian Ave & 55th St              2
##  3 Burling St & Diversey Pkwy          2
##  4 California Ave & 36th St            2
##  5 Campbell Ave & 51st St              2
##  6 Central Park Ave & Douglas Blvd     2
##  7 Cicero Ave & Wellington Ave         2
##  8 Fairfield Ave & 44th St             2
##  9 Francisco Ave & 47th St             2
## 10 Harding Ave & 26th St               2
## # ℹ 39 more rows

There are 49 station names that are associated with multiple station IDs.

We can also check the validity of the trip durations by calculating the number of minutes between the start and end times of each trip, and then reviewing the summary statistics for these values.

df |> 
  mutate(trip_duration = as.numeric(difftime(ended_at, started_at, units = "mins"))) |> 
  pull(trip_duration) |> 
  summary()
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -2748.317     5.545     9.717    17.316    17.250  1559.933

Here we can see that most of the values lie between 5 minutes and 17 minutes, but there are some outliers. The minimum value is -2,748 minutes, which doesn’t make sense. It means that the end time is before the start time. We should plan to exclude any negative values of trip duration in our analysis.

df |> 
  mutate(trip_duration = as.numeric(difftime(ended_at, started_at, units = "mins"))) |> 
  filter(trip_duration < 0) |> 
  count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1   227

There are 227 records with a negative trip duration.

Data integrity

The table below summarises our analysis of the integrity of this data set.

Principle Passing threshold Pass/Fail Comments
Reliable Accurate, complete, unbiased Fail Data has missing and inconsistent values for station IDs and locations. Proceed with caution when using these fields
Original Data comes from original source Pass Data is from original source
Comprehensive Contains all critical information needed to answer the question Pass Data is provided for all trips, with the exception of missing data as noted above
Current Current and relevant to the task at hand Pass Data is provided up to and including last month
Cited Data comes from a known and credible source Pass Data is provided directly by Lyft, the company that operates the bike-share system

Based on our analysis there are some reliability issues that we have to deal with. Where we need to analyse a question that relies on fields with missing values, the observations with missing data will first be excluded from the data set.

Process data

In this section, we document the steps taken to clean the data ready for analysis. In order to be ready for analysis the data will:

  • Have a consistent and appropriate data type for each field
  • Be free of null values (using either an inferred value or a string to denote that the data is missing)
  • Station IDs and names will be consistent (there should be a one-to-one relationship between station ID and station name)
  • Exclude trips with a negative trip duration (where the end time occurs before the start time)
  • Include engineered features necessary to answer the business question
  • Be converted into a dimensional data model

The data processing steps for this analysis are performed in the R/2_process_data.R script.


  1. Data is provided under licence from Lyft Bikes and Scooters, LLC↩︎